Normalizing Microtext
نویسندگان
چکیده
The use of computer mediated communication has resulted in a new form of written text—Microtext—which is very different from well-written text. Tweets and SMS messages, which have limited length and may contain misspellings, slang, or abbreviations, are two typical examples of microtext. Microtext poses new challenges to standard natural language processing tools which are usually designed for well-written text. The objective of this work is to normalize microtext, in order to produce text that could be suitable for further treatment. We propose a normalization approach based on the source channel model, which incorporates four factors, namely an orthographic factor, a phonetic factor, a contextual factor and acronym expansion. Experiments show that our approach can normalize Twitter messages reasonably well, and it outperforms existing algorithms on a public SMS data set. Introduction The Web has become the channel in which people communicate, write about their lives and interests, and give opinions and rate products. In some contexts, users, especially young users, write in an informal way without minding spelling or grammar, even deliberately shortening the words or using slang. Words or phrases such as “lol” (laugh out loud), “c u 2nite” (see you tonight) and “plz” (please) which may not be found in standard English are widely used by Web users. The casual usage of language results in a new form of written text which is very different from well-written text. This chat-speak-style text is especially prevalent in Short Message Service (SMS), chat rooms and micro-blogs. Such chatspeak-style text is referred to as Microtext by (Ellen 2011). In this work, Tweets and SMS messages are explored as typical examples of microtext. There has been a big effort to produce natural language processing (NLP) algorithms and tools that try to understand well-written text, but these tools cannot be applied out of the box to analyze microtext which usually contains noise, including misspellings, abbreviations and improper grammar. Tools such as Named Entity Recognition have been reported to have substantially lower performance on Tweets than on structured text, partly due to the high amount of noise present in Tweets (Murnane 2010; Copyright c © 2011, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved. Corvey et al. 2010). In order to effectively retrieve or mine data from microtext, it is necessary to normalize the microtext, so that they are more readable for machines and humans, and become suitable for further treatment using standard NLP tools. The task of microtext normalization addressed in this work has many similarities to traditional spelling correction but also poses additional challenges. Both the frequency and severity of spelling errors in microtext are significantly greater than in normal text. In our datasets, a significant fraction of microtext (31% of Twitter messages, 92% of SMS messages) requires normalization. In microtext, sometimes words are misspelled due to typographic errors such as “positon” for “position”. In other cases, such Non-Standard Words (NSWs), words which are written differently from their standard forms, are used intentionally for various reasons, including phonetic spelling (enuf for enough), emotional emphasis (goooood for good), popular acronyms (asap for as soon as possible), etc. Some NSWs are so widely used such that they are more recognizable than their original standard forms such as ATM, CD, PhD and BMW. Typically, the validity of a microtext term cannot be decided by lexicon look-up or by checking its grammaticality. Therefore, general purpose spelling correction methods which rely on lexicon look-up are not able to solve the problem of normalizing microtext. In this paper, we address the problem of normalizing microtext using the source channel model. The problem of normalizing microtext is viewed as that of recovering the intended word or word sequence given an observed word which may or may not be a NSW. Our model will take four factors into consideration, namely, character string-based typographical similarity, phonetic similarity, contextual similarity and popular acronyms. Related Work The following sections compare microtext normalization with similar and related applications. General Text Normalization General text normalization has been well studied in text-tospeech (Sproat et al. 2001). General text normalization deals with tokens such as numbers, abbreviations, dates, currency amounts and acronyms which are normative while microtext normalization needs to deal with lingo such as “l8” (late) and “msg” (message), which are typically self-created and are not yet formalized in linguistics. Moreover, microtext is brief, containing a very limited number of characters (140 characters for Tweets, and 160 characters for SMS messages). It is thus much more difficult to use contextual or grammatical information. Spelling Correction Short message normalization has many similarities with traditional spelling correction which has a long history. There are two branches of research in conventional spelling correction, which deal with non-word errors and real-word errors respectively. Non-word correction is focused on generating and ranking a list of possible spelling corrections for each word not found in a spelling lexicon. The ranking process usually adopt some lexical-similarity measure between the misspelled string and the candidates considering different edit operations (Damerau 1964; Levenshtein 1966), or a probabilistic estimation of the likelihood of the correction candidates (Brill and Moore 2000; Toutanova and Moore 2002). Real-word spelling correction is also referred to as contextsensitive spelling correction, which tries to detect incorrect usage of valid words in certain contexts (Golding and Roth 1996; Mangu and Brill 1997). Spelling correction algorithms rely on spelling lexicons to detect misspelled tokens, which is an unrealistic approach for microtext normalization. Short messages often contain valid words that are not found in any traditional lexicon (e.g., iTune, Picasa), sometimes may even contain in-lexicon words that are actually intended to be other legitimate words (I don’t no for I don’t know, and agree wit u for agree with you). In the open Web, we clearly cannot construct a static trusted lexicon, as many new names and concepts become popular every day and it would be difficult to maintain a high-coverage lexicon. Therefore, the models used in spelling correction are inadequate for microtext normalization. SMS Normalization While there are few attempts to normalize Twitter messages or microtext in general, there is some work on SMS normalization. Text messages, which are also called Short Message Service (SMS) texts, are similar to Tweets considering the length of messages and the devices used to generate the messages. Text message normalization has been handled through three well-known NLP metaphors: spelling correction, machine translation and automatic speech recognition. The spelling correction metaphor (Choudhury et al. 2007; Cook and Stevenson 2009) performs the normalization task on a word-per-word basis by applying the noisy channel model. The machine translation metaphor, first proposed by Aw et al. (2006), considers the process of normalizing text messages as a statistical phrase-based machine translation task from a source language (the text message) to a target language (its standard written form). Kobus et al. (2008) proposed to handle text message normalization through an automatic speech recognition metaphor based on the observation that text messages present a lot of phonetic spellings. Beaufort et al. (2010) took a hybrid approach which combines spelling correction and machine translation methods. Our generic approach, detailed below, incorporates components from both spelling correction and automatic speech recognition. Normalization Model Given a piece of microtext, our model will normalize terms one by one. Thus, the challenge is to determine the corresponding standard form t, which can be a term or a sequence of terms, for each observed term t. Notice that t may or may not be the same with t. t is a NSW if t 6= t. Thus, the task is to find the most probable normalization t for each observed term t: t∗ = arg max t P (t|t′) = arg max t P (t′|t)P (t) (1) P (t|t) models the noisy channel through which an intended term t is sent and is corrupted into the observed term t. P (t) models the source that generates the intended term t. In practice, the source model can be approximated with an n-gram statistical language model. Since a non-standard term and its corresponding standard form might be similar from one or multiple perspectives(s), it is reasonable to assume that there are several channels, each of which would distort an intended term in one of the above aspects. More specifically, a grapheme channel would be responsible for the spelling distortion, a phoneme channel would cause phonetic corruptions, a context channel may change terms around a target term and an acronym channel may shrink a sequence of terms into one term. In reality, an intended word may be transferred through one or multiple channels, and an observed term might be a mixed corruption from more than one channel. Under this assumption, and letting {ck|ck ∈ C} denote the set of channels, Equation 1 can be further developed to t∗ = arg max t ∑
منابع مشابه
Chinese Informal Word Normalization: an Experimental Study
We study the linguistic phenomenon of informal words in the domain of Chinese microtext and present a novel method for normalizing Chinese informal words to their formal equivalents. We formalize the task as a classification problem and propose rule-based and statistical features to model three plausible channels that explain the connection between formal and informal pairs. Our two-stage selec...
متن کاملClustering Microtext Streams for Event Identification
The popularity of microblogging systems has resulted in a new form of Web data – microtext – which is very different from conventional well-written text. Microtext often has the characteristics of informality, brevity, and varied grammar, which poses new challenges in applying traditional clustering algorithms to analyze microtext. In this paper, we propose a novel two-phase approach for cluste...
متن کاملContrasting Machine Learning Approaches for Microtext Classification
The goal is classification of microtext: classifying lines of military chat, or posts, which contain items of interest. This paper evaluates non-linear statistical data modeling techniques, and compares with our previous results using several text categorization and feature selection methodologies. The chat posts are examples of 'microtext', or text that is generally very short in length, semi-...
متن کاملA CCG-Based Approach to Fine-Grained Sentiment Analysis in Microtext
In this paper, we present a Combinatory Categorial Grammar (CCG) based approach to the classification of emotion in microtext. We develop a method that makes use of the notion put forward by Ortony, Clore, and Collins (1988), that emotions are valenced reactions. This hypothesis sits central to our system, in which we adapt contextual valence shifters to infer the emotional content of a text. W...
متن کاملLearning Ontologies from the Web for Microtext Processing
We build a mechanism to form an ontology of entities which improves a relevance of matching and searching microtext. Ontology construction starts from the seed entities and mines the web for new entities associated with them. To form these new entities, machine learning of syntactic parse trees (syntactic generalization) is applied to form commonalities between various search results for existi...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2011